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Abstract 

This paper describes the use of cluster analysis 
for determining the information model within col- 
lected software engineering development data at the 
NASA/GSFC Software Engineering Laboratory. We 
describe the Software Management Environment tool 
that allows managers to predict development at- 
tributes during early phases of a software project and 
the modifications we propose to allow it to develop dy- 
namic models for better prediction of these attributes. 

Keywords: Cluster analysis; Data modeling; Mea- 
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1 Introduction 

Software management depends upon managers to 
collect accurate data of the software development pro- 
cess and on the production of accurate models upon 
which to use that. data. Lines of code is still the most 
widely used measure for cost and error analysis, even 
though it is known to be inaccurate [8]. However, since 
it is not known until the completion of a project, its 
use as a predictive measure is not reliable. What are 
needed are more accurate models of the software de- 
velopment process. 

Current models are developed according to broad 
categories, such as waterfall development, spiral model 
development, cleanroom development, etc., witli addi- 
tional qualifiers giving a few attributes of the product 
(e.g., real time, embedded application, data base). 
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Data is often collected and projects are compared to 
historical baselines according to these general cate- 
gories. For example, the COCOMO model [1] is based 
upon a small set of predefined factors, and predictions 
are made according to how a new project measures up 
to these factors. 

It is difficult for software managers, however expe- 
rienced they are, to evaluate the status or quality of a 
software development project and make correct deci- 
sions without accurate, reliable measurement models 
and data. These data include metrics aimed at clari- 
fying and quantifying some quality of either a software 
product, or the development process itself [13]. 

Since we do not have accurate models of the soft- 
ware development process, perhaps, we can use the 
data itself to develop dynamic models of software 
development that reflect the changing nature of the 
development process. In this paper we study one 
particular modeling technique, cluster analysis, as 
a means for determining the underlying information 
model present in the collected software engineering de- 
velopment data. 

The importance of software management has led, 
to the development of various software management 
tools for aiding in this effort. These tools help soft- 
ware managers get access to, visualize, and analyze 
measurement data. The Software Management Envi- 
ronment (SME) is one of those tools developed within 
the NASA Goddard Space Flight Center Software En- 
gineering Laboratory (SEL) [6], [12]. It is the purpose 
of this paper to investigate the use of cluster analysis 
within SME to enhance the ability of software man- 
agers to predict and control the software development 
process. 

In Section 2 we describe the information model and 
the measures used by SME. In Section 3 we describe 
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our use of cluster analysis to dynamically change our 
information model, and in Section 4 we describe some 
preliminary results of using our new model. We then 
give our conclusions to this work. 


2 Measurement in SME 

For over 15 years the software engineering commu- 
nity has been studying various models of the soft- 
ware development process. Concepts like Halstead’s 
software science measures, Putnam’s Rayleigh curve, 
Boehm's COCOMO model, among many others, are 
all attempts at providing a quantitative model un- 
derlying the software development cycle. Unfortu- 
nately, most of these models are very general, and 
while broadly describing the software process, do not 
have the granularity to make accurate predictions on 
a single software project. 

As a way to further these studies, the Software En- 
gineering Laboratory was established to evaluate the 
above models and develop new models within a pro- 
duction programming environment. 


2.1 NASA/GSFC SEL 

The NASA Goddard Space Flight Center Software 
Engineering Laboratory is a joint research project 
of GSFC Flight Dynamics Division, Computer Sci- 
ences Corporation and the University of Maryland. 
Data from over 100 projects has been collected since 
1976 and a data base of over 50 Mbytes of measure- 
ment data has been developed. Initially supporting 
100,000 line FORTRAN ground support software for 
unmanned spacecraft written by 10 to 15 program- 
mers over a 2 year period for an IBM mainframe, the 
SEL data base now includes a wider variety of projects 
consisting also of Ada and C code for a variety of ma- 
chines. 

The SEL collects data both manually and automat- 
ically. Manual data includes effort data (e.g., time 
spent by programmers on a variety of tasks - design, 
coding, testing), error data (e.g., errors or changes, 
and the effort to find, design and make those changes), 
and subjective and objective facts about projects (e.g., 
start and end completion dates, goals and attributes 
of project and whether they were met). Automatically 
collected data includes computer use, program static 
analysis, and source line and module counts. 


2.2 Measure Models 

Data modeling often combines various measures in 
order to evaluate attributes in a software development. 
For example, classification trees were used as part 
of the Amadeus project [9] [10] and a variant of that 
method was used within the SEL [11]. In this case, a 
tree is generated where each leaf node represents one 
of several results. Based upon values for each mea- 
sure, a path down the tree is taken until a result at a 
leaf node is reached. 

For each project, we can compare the collected data 
over time with a predefined model of a similar project 
from the data base. A basic measure model refers to 
the expected behavior of a software development mea- 
sure as a function of time [5]. Measures, developed 
from the raw data collected by the SEL, include lines 
of code, staff hours, computer hours, and changes and 
errors. A measure model is usually obtained by exam- 
ining the data for that measure over a set of projects 
and averaging them. Time is described in terms of 
the four major phases of software development within 
the waterfall life-cycle: design, code and unit test, sys- 
tem test, and acceptance test. 1 Measure behavior is 
described in terms of percent completion of that mea- 
sure at each distinct checkpoint. 

Within the SEL, we describe one of these measure 
models as a vector of 15 points, each representing the 
percent completion of the measure at distinct dates 
in the development cycle (generally 25% increments 
through each phase). Table 1 shows the tabular repre- 
sentation of a Lines of Code (LOC) model [5] and Fig- 
ure 1 shows the graphical representation of the same 
model. According to the LOC model, no code should 
be written during the design phase, and most of the 
code (76%i) should be written during the code and unit 
test phase. 

For ease of use, we can use the vector representation 
of the model: 

[0,0,0,0,0,6.86,36.05,53.99,76.28, 

86.82, 94.88, 96.09, 98.14, 99.58, 100] 

In general, a measure model can be represented by 
the following vector: 

p = [PO,Pl>P2,-.,Pl3,PL4] 

*The SEL does not collect specification data since that task 
is performed by another group. This is reflected in the models 
that the SEL develops, and is a good indication why no two 
development models are easily transportable across locations. 
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Phase 

% of Phase 

% of Total Lines 

Design 

0 

0.00 


25 

0.00 


50 

0.00 


75 

0.00 

Code/ Unit 

0 

0.00 

Test 

25 

6.86 


50 

36.05 


75 

53.99 

System Test 

0 

76.28 


50 

86.82 

Acceptance 

0 

94.88 

Test 

25 

96.09 


50 

98.14 


75 

99.58 

End 

100 

100.00 


Table 1: Tabular representation of a LOC model 



Figure 1: Graphical representat ion of a LOC model 
with po = 0, 0 < pi < 100 for 1 < ? < 13, and 

pi 4 = 100 . 

We will use measure pattern to refer to a measure 
model derived from a single project. Essentially, we 
produce a measure model as the average of some set 
of measure patterns. 

2.3 SME 

The Software Management Environment was devel- 
oped to help software managers carry out manage- 
ment activities like observation, comparison, predic- 
tion, analysis, and assessment [6]. In order to provide 
these functions, SME uses measurement data from 
current and past projects from the SEL database, re- 



Figure 2: LOC patterns 


search results in terms of models and relationships, 
and manager experience from the past. 

SME was initially built with a fixed set of measure 
models. For example, for LOC (lines of code), the 
most apparent predictor seemed lo be programming 
language. Therefore, SME originally had two models 
of LOC based upon language - Ada and FORTRAN. 
Each project was classified according to the measure 
model it was expected to adhere to, and for each mea- 
sure type, a predefined measure model was stored in 
the data base. 

Some of the features of SME are described below. 

Measure models in SME 

Currently in SME, a measure model is derived from 
a set of projects with the same characteristics, such 
as development methodology, programming language, 
and development environment. SME decides which 
measure model to use for a project measure of inter- 
est based on the characteristics of that project. For 
example, Figure 2 shows four LOC patterns of four 
different projects with the same characteristics. SME 
creates a LOC model by averaging these patterns, but 
is the resulting model a good representative of actual 
LOC behavior? This is the basic question behind our 
research plan, and our goal is to develop, dynami- 
cally, LOC (and other) models that better represent 
attribute behavior. 

Observation and Comparison 

To monitor the progress of a project, managers need 
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Figure 3: Growth in ’Lines of Code’ for Pi 


cumulative growth data for measures such as effort, 
size and errors. SME provides graphic display of the 
actual collected data like shown in Figure 3, in which 
the solid curve represents an overall view of project 
Pi's growtli in size (lines of code) over a specified cal- 
endar time. The dotted curve in Figure 3 shows a 
LOC model of a similar project or the LOC measure 
model from the data base to permit the manager to 
compare project data to a model which indicates the 
“normal behavior” for such projects. Comparison can 
also be made between projects. 

Prediction 

SME can also predict a measure’s completion value 
for an on-going project, by using the appropriate mea- 
sure model scaled up to the actual time schedule of the 
new project. Using the initial data collected from a 
project, final values can be estimated giving the man- 
ager an indication of the measure’s possible future be- 
havior. 

Analysis and Assessment 

SME can help the manager identify the probable 
causes of any unexpected behavior for a measure, and 
assess the quality of a project based on all the mea- 
surement. data. For each measure, a knowledge base of 
cause-effect relationships is maintained. So, for exam- 
ple, if a given project seems to have too many errors 
at a certain point in the coding phase compared to the 
error measure model, a rationale can be provided to 
the manager, such as: 


TEAM IS REPORTING INCONSEQUENTAIL ERRORS 

INEXPERIENCED DEVELOPMENT TEAM 

POOR USE OF METHODOLOGY 

COMPLEX PROBLEM 

etc . 

Similar idea can be found in [4]. What is desired 
is a mechanism whereas this knowledge base can be 
updated dynamically as projects evolve. 

3 Cluster analysis 

Cluster analysis is the technique for finding groups 
in data [7] that represent the same information model. 
Biologists and social scientists have long used it to 
analyze their data. Here, we use it to find similar 
measure patterns within the collected software devel- 
opment data. 

Clustering was used previously in an early SEL 
study [3] in order to determine possible patterns in 
projects by clustering the modules that make up the 
project. The results were somewhat inconclusive due 
to large variances within small modules and the many 
different attributes that contributed to the single value 
that was clustered. In this current study, we try to 
separate out different attributes and study their ef- 
fects over time. This gives greater precision to the 
data we are looking at and eliminates much of the 
variability found in the earlier study. 

3.1 Clustering 

As stated in section 2.2, a measure pattern can be 
represented by a vector. Clustering is a method to 
determine which vectors are similar and represent the 
same or similar physical objects. There are several 
clustering and modeling algorithms, including: 

• Euclidean distance. Each vector represents a 
point in n-space. Points near one another are in 
the same cluster. 

• Cosine. Each measure pattern represents a vector 
from the origin. The cosine of the angle between 
two vectors represents the similarly in their com- 
ponents and hence their closeness. 

• Optimal Set Reduction. OSR generates, based on 
search algorithms and univariate statistics, logical 
expressions which represent strong patterns in a 
data set [2]. 
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Figure 4 : A cluster of LOG patterns 

Several other algorithms also have been used. 

For our initial investigation, we are using the Eu- 
clidean distance between two vectors as a degree 
of similarity between two measure patterns. For 
example, if P — [po, Pi . P2> ■ * ■ iPi3i Pw] and ;V = 
[rig j n i , n?, . . • , ni3, 77 1 4] are two measure patterns, 
then their Euclidean distance is 

ed(P, N ) = \/(Po - n o) 2 + ■ ■ ■ + (p\a - »m) 2 

Two patterns are assumed similar and are in the same 
cluster if and only if ed(P, N) < £. 

Note that by varying e we can adjust the size of 
the clusters by specifying how close two vectors must 
be in order to be in the same grouping. Since single 
vector clusters provide no information, we want to ad- 
just e so that we generally have clusters of at least 3 
vectors without including vectors that represent fun- 
damentally different curves. Figure 4 shows a cluster 
of three LOC patterns. 

3.2 Cluster model 

A cluster model is the average of all measure pat- 
terns in one cluster. It closely describes the measure 
behavior for all projects in the cluster because mea- 
sure patterns in the same cluster are similar. Instead 
of choosing a predefined measure model for a project 
measure of interest using the project’s characteristics 
(as is currently the case with SME), a cluster model 
can be dynamically selected for the project measure 
depending on which cluster its pattern best fits. 


A further advantage from the current static ap- 
proach of SME, is that alternative models can be de- 
veloped for each measured attribute. Within SME, 
the same measure model is used for all measured at- 
tributes. For example, if the defining characteristic is 
Ada for the LOC measure, it will be the Ada mea- 
sure model for each other measure (e.g., error, efTort). 
With dynamic clustering, measure models can vary for 
each distinct measure. 

For an ongoing project, a manager’s estimate of 
schedule and measure completion values are used to 
derive its measure patterns. Estimates are replaced 
by real data once they become available. So a project 
measure’s closest cluster model may change as the 
project develops. In Section 4.3 we discuss how to use 
this information to improve on the predictive capa- 
bilities of SME. On the other hand, since a project s 
development methodology or programming language 
usually do not change during a project’s development 
life-cycle, the static measure model chosen by the cur- 
rent implementation of SME based on those charac- 
teristics does not change. 

Similarly, SME does an assessment of a project V 
real data compared to the measure model's estimate 
by use of a predefined set of attributes. But by look- 
ing at the attributes that are common for all projects 
within a given cluster, we may be able to determine 
general characteristics for any new project that falls 
within that cluster. This list of attributes will dy- 
namically evolve over time instead of being a static 
description of project behavior. For example, if all 
projects within a given cluster were previously late in 
delivery, it may be useful to report this information 
to the manager of a new project that falls within this 
cluster. 

This allows the knowledge base to grow and change 
dynamically as projects develop. It does not require 
the predefinition of a few models - which may not even 
accurately represent the actual development model, 
only a manager’s poor estimate of one. 

4 Evaluation of Clustering 

Before implementation of our clustering approach 
within SME, we evaluated the effectiveness of clus- 
tering with a subset of the SEL data base. Mea- 
surement data from twenty-four projects in the data 
base were clustered using eight different measures: 
computer hours (CPU), total staff hours (EFF), lines 
of code (LOC), modules changed (MCI1), module 
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Figure 5: Two clusters of MCH patterns 

count (MOD), reported changes (RCH), reported er- 
rors (RER), and computer jobs (RUN). We then stud- 
ied common objective and subjective attributes of 
projects in the same cluster. 

For example, Figure 5 shows two clusters of MCII 
(module changes) patterns. Cluster C\ consists of pat- 
terns from projects P 3 , Pi 3 and Pi 9 , and duster Co 
consists of patterns from projects Pg, Pi e , ^20 We 
observe that more than half of the module changes 
were made during the code and unit test phase for 
projects in Co compared to about twenty precent for 
projects in Ci . Consequently, only twenty percent of 
the module changes were made during the system test 
phase for Go compared to about fifty percent for Ci. 

4.1 Objective characteristics 

Project characteristics of the two clusters are suin- 
maried in Table 2 and 3 respectively. We observe that 
if computer language is the basis for choosing a MCH 
measure model, as is the case with the current ver- 
sion of SME, all six projects will use the same MCH 
model since they all use FORTRAN. In this case, clus- 
tering discovers the two vastly different behaviors of 
MCH measures which are undetectable with the static 
approach. 

In addition, some commonly used discriminators do 
not appear to be significant with these clusters. Size is 
often used to classify projects, yet cluster C \ contains 
projects from 16K to 179K source lines. The projects 
represent two very different hardware and software en- 
vironments (IBM mainframe and DEC VAX VMS) 
and each project in C\ represents a different applica- 


Attributes 

P 3 

Pl3 

P 19 

Computer 

IBM 

DEC 

IBM 

Language 

FORT. 

FORT. 

FORT. 

Application 

AGSS 

SIM. 

ORBIT 

Reuse (%) 

10.1 

30.7 

38.1 

Time (wks) 

116 

119 

109 

Size (SLOC) 

178.6 

36.6 

15.5 


Table 2: Project characteristics for cluster C\ 


Attributes 

P 6 

Pi 6 

7*20 

Computer 

mmu 

IBM 

IBM 

Language 

FORT. 

■an™ 

FORT. 

Application 



AGSS 

Reuse (%) 


1.9 

10.0 

Time (wks) 


87 

147 

Size (SLOC) 


233.8 

295.4 


Table 3: Project characteristics for cluster Ci 


tion area. (However projects in Co are more homoge- 
neous; they all represent relatively large 168K to 296K 
attitude ground support systems 1 mill as mainframe 
IBM applications.) 

4.2 Subjective characteristics 

Subjective data for each project is stored in the 
data base as an integer between 1 (low or poor) and 
5 (high). Each project manager fills in these values 
at the end of a project based upon experiences during 
the development. For each cluster we retrieved those 
subjective attributes that differed by at most 1 within 
the cluster, thus indicating a common feature for those 
clustered projects. This information can then he fed 
back to the manager of a new project that falls within 
that cluster to provide an indication of probable future 
behavior. 

Projects in cluster C 2 have common ratings on the 
following subjective attributes: 

Tightness of schedule constraints : 3 
Access to development system: 3 
Timely software delivery: 4 

We notice that their rating for timeliness of software 
delivery is relatively high. This could be a direct re- 
sult of the fact that most module changes were made 
during code and unit test phase. 
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Figure 6: Prediction for CPU 


4.3 Predictive models 

The two clusters of Figure 5 are easiest to mea- 
sure when all data points for each measure model are 
available. However, it is the very nature of predic- 
tive models that some of this data is incomplete. We 
are currently altering SME’s predictive capabilities to 
take this into account. 

If data is available for new project P up through 
point, i (e.g., values for po,Pi, ■ then clustering 

for P against each existing cluster will be only with 
respect to these i T 1 points. That is, for each cluster 
C, it will be assumed that p l and a have the same 
value and P’s other values will be scaled accordingly. 
Clustering will determine winch cluster has the closest 
shape to P’s shape. 

Once a matching cluster is found, it will be as- 
sumed that project P has the same characteristics as 
this found cluster and the succeeding values for P will 
match the cluster’s measure model for points i T 1 
through 14. 

The effect will be to scale P’s original estimate with 
respect to the cluster’s estimate. For example, in Fig- 
ure 6, if the cluster estimated a 50% completion by 
point 8 and the actual data showed a 75% “comple- 
tion,” then it can be assumed that the actual com- 
pletion will be 150% since the relevant cluster is only 
half finished. In this case it can be assumed that the 
manager underestimated the resources needed for this 
project. We are currently modifying SME’s graphical 
interface to show these predicted curves. 


The predictive model for project P depends upon 
both estimating the total resources needed in order to 
compute the percentage for point P, and estimating 
the schedule in order to determine how far one has 
progressed in the current development phase. Either 
one, however, may not be accurate. For example, cur- 
rent point Pq represents 50% coding, yet that is only 
known when coding is complete. The current date 
may possibly range from perhaps the 25% level (and 
lienee really represent point P 5 ) to the 75% level (and 
hence really represent point Pj) depending upon how 
accurately the initial schedule was set up. The true 
date will be known only after the coding phase is com- 
pleted. However, in the above paragraphs we have de- 
scribed a mechanism to estimate resource needs when 
we assume that the schedule is correct. 

On the other hand, if the latest available project 
point Pi is scaled to a cluster model horizontally along 
phases instead of vertically (i.e., by changing the es- 
timated schedule), we can predict future changes in 
project schedule. However, since only discrete mile- 
stones of a schedule are used, they need to be quan- 
tified before numerical scaling can be applied. We 
are looking at extending the SME predictive model in 
order to estimate both the resource needs as well as 
potential bounds on the schedule based upon current 
data. 

It should be realized that the model’s predictive 
capabilities improve as a project develops. Very few 
points are available for prediction early in the develop- 
ment cycle leading to few differences among the var- 
ious clusters. On the other hand, late in the devel- 
opment cycle where there is more variability among 
the clusters, it may be too late to change develop- 
ment models to account for any potential problems. 
IIow well the early predictions lead to significant dif- 
ferences in project development attributes is obviously 
an issue we need to investigate. 

4.4 Model evaluation 

Aside from its primary use as a tool to aid man- 
agement in predicting future behavior on a current 
software development project, use of cluster analysis 
permits SME to be used as a tool to evaluate new 7 
models. If a model is proposed that describes some 
attribute of development that is collected by the SEL 
data base, then all projects within a cluster should 
exhibit that attribute to a great extent. 

For example, the SEL is currently planning to en- 
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hance the SEL data base with additional predefined 
measure models in addition to the two models used 
at present. Often the following attributes (and their 
relevant values at NASA) are viewed as important at- 
tributes of a development methodology: 

• Computer' use - IBM or DEC environment 

• Reuse of existing source code - Low, medium or 
high reuse of existing source code 

• Language - FORTRAN or Ada as a source pro- 
gramming language 

• Methodology - Cleanroom or standard NASA wa- 
terfall development method 

By choosing one value from each category, the SEL can 
develop 24 possible models. A subset of these will be 
built into the SEL data base as predefined models for 
each project and each project will be assigned to one of 
these categories. However, while they are often viewed 
as crucial attributes, are these really discriminators 
useful to differentiate among projects? 

If these are really discriminators of project devel- 
opment, then projects within a single cluster should 
all consist of the same predefined measure model (or 
at least predominately so). We can then use our clus- 
tering approach to determine the effectiveness of the 
new proposed models. 

We can also use clustering to determine if there 
are any relationships among measures. If a cluster for 
Reported Change (RCH) consists of the same projects 
as a cluster for Reported Error (RER), this indicates 
that, those two measures are closely related. If projects 
A and B are in the same cluster for CPU, LOG and 
RUN, then those projects are somewhat related. 

This approach can be extended to any quantitative 
model. Projects in the data base can be grouped ac- 
cording to how well they meet the discriminators of 
any new proposed measure. The projects can be clus- 
tered, and if the models are appropriate, then clusters 
should be somewhat homogeneous. 

For example, cleanroom is a technique that ad- 
dresses early verification of a design that should result 
in fewer resulting errors (with less testing necessary) 
later in the development cycle. If so, then measur- 
ing reported errors (RER) per computer run (RUN) 
should cluster cleanroom projects together, and the 
plots should show high measure model values early in 


the development cycle. We can use SME to test such 
claims from this and other proposed models. 

4.5 Evaluation of clustering 

Clustering is effective in distinguishing measure 
behaviors. For most of the measures studied, we 
were able to yield clusters that differentiated behavior 
among the projects, whereas the current SME would 
consider them all similar and use the same measure 
model on that data. 

A current weakness, however, is that the result- 
ing clusters yield few common objective or subjective 
characteristics. We believe that this is due more to the 
nature of the current subjective files within the SEL 
data base than in the clustering methodology itself. 
The current data files are developed by the project 
managers and contain attributes about the project 
(e.g., external events such as schedule and require- 
ments changes, team composition, environment com- 
position). There is little about how management was 
performed (e.g., we didn’t test enough, we started cod- 
ing too soon). This is understandable given how the 
data was collected. We need to develop methods to 
collect this latter data in a non-threatening manner 
from each project manager so that it can be fed back 
to future project managers more effectively. 

5 Conclusion 

In this paper, clustering is presented as a mech- 
anism for dynamically determining and altering the 
information model that describes certain attributes of 
the software development process. This permits the 
software manager to more accurately predict, the fu- 
ture behavior of a given project based upon similar 
characteristics of existing projects in a data base. We 
believe the resulting cluster models are fairly accurate 
indicators of such behavior. 

Clustering also permits rationale for deviations 
from normal behavior to be determined dynamically 
and are easier to generate than the existing expert 
system approach. Preliminary evaluation of cluster- 
ing leads us to believe that the resulting models are 
fairly accurate indicators of such behavior 

In addition, it appears that some often used dis- 
criminators may not be totally effective in classify- 
ing projects. Size, programming environment and ap- 
plication domain may unnecessarily separate projects 
into categories that are ultimately the same (e.g., see 
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Tables 2 and 3). Obviously, this needs further study. 

We are in the process of modifying NASA/GSFC’s 
SME management tool for incorporation of these new 
models into the tool. We believe that this should 
greatly improve SME’s predictive capabilities. Mod- 
ification of the data in the SEL subjective data files 
should greatly aid in the analysis and assessment as- 
pects of SME. 

However, the process is far from over. We also 
intend to study alternative clustering and modeling 
techniques (e.g., Optimal Set Reduction, Cosine) in 
order to determine the best approach towards measur- 
ing these critical attributes. In addition, we need to 
observe how well early predictions of a project match 
with subsequent observations in order to be able to 
use SME as an effective management planning and 
tracking tool. 
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